Probing Gender Bias in Machine Learning 


Ali Soliman 
ali.soliman@rwth-aachen.de 
RWTH Aachen University 


Abstract 


Digital products are no longer a commodity, it has become 
an integral part of how we live our lives. These digital prod- 
ucts have recently nurtured the power of adopting machine 
learning techniques, thus making them a powerful tool. But 
machine learning models intend to carry on the bias of their 
creators. Along with the rise of machine learning models 
in our lives, there is a rise in probing these models for how 
fair they are. In this study, we examine one of these biases, 
namely gender bias. We aim to shed light on how in eCom- 
merce settings, the category of products might spill informa- 
tion on the gender of their buyers and to what extent this 
effect can be measured. 


CCS Concepts: « Information Systems — Machine Learn- 
ing; Text Mining. 


Keywords: fairness, neural networks, gender bias 


1 Introduction 


The task of identifying the author’s gender from a text raises 
the question of whether women and men write differently. 
Research indicates that women indeed write more "involved" 
while men write more "informative" [2]. Classifiers designed 
to determine author gender often also rely on content infor- 
mation, taking advantage of men’s and women’s tendency to 
write about different topics. Thus, two main factors can be 
used to classify texts by author gender: writing style and con- 
tent. While content describes what is written about, writing 
style describes how it is written, regardless of the content. 
A much-used example of language Style in the context of 
business vs. casual style is: "Let us arrange a meeting for 
Tuesday" and "Let’s hang out on Tuesday" [3]. In a metaphor- 
ical sense, languages or translating into another language 
can also be understood as style or style transfer. While ma- 
chine learning models already deliver pretty good results in 
classifying author gender, humans tend to type texts based 
on their content. This raises the question to what extent ma- 
chine learning models also stereotype based on content. To 
investigate this, we train multiple author gender classifiers 
on the Amazon review dataset. The Amazon dataset contains 
product reviews from different categories, such as Electron- 
ics, Clothing, or Books. Our specific research questions are 
whether and to what extent machine learning models pick 
up on the stereotypes and classify reviews based on category 
keywords and to what extent the categories influence the 


Ann Kliemsch 


ann.kliemsch@rwth-aachen.de 
RWTH Aachen University 


writing style. Our results show that we already achieve out- 
standing results with a simple LSTM with 78% on the unbal- 
anced categories Electronics and Clothing. As we obfuscate 
the content and balance the dataset, the accuracy decreases; 
however, despite a male bias in classifying author gender, we 
can show that our model also detects a content-independent 
writing style. When classifying, we only consider the review 
text and no other metadata. Since the review text is not la- 
beled by gender, we determine the gender of users based on 
their usernames. We only use the reviews of users for which 
a gender classification is possible with high probability. 


2 Dataset & Preprocessing 


The reviews that we are considering are taken from the 
Amazon Review dataset [4]. A total of 223.1 million reviews 
from the years 1996 to 2014 are available. 

To each review the following features are assigned: the 
user identification of the reviewer and his username, the 
asin (identification number for Amazon products) for the re- 
viewed product, the rating in stars, the review text, the Help- 
fulness Score, whether the purchase was verified by Ama- 
zon (the product was purchased through Amazon), the time 
at which the review was submitted and other information, 
which we do not take into account. In addition, meta-data is 
provided that contains information about the products. Since 
the Amazon review dataset is very large, we used the 5-core 
datasets that are divided by category. The 5-core datasets 
only include users and items that have at least 5 reviews. 


2.1 Labeling 


We label the data based on the username. For this, the first let- 
ter combination is extracted from the username, that means 
concretely e.g. April Brooks -> April, 23Summer Girl -> Sum- 
mer. Names containing ‘&’ or ‘and’ are excluded because 
they indicate a shared account. However, it cannot be ruled 
out that accounts are used by more than one person. Then 
we used the gender-guesser package for python and addi- 
tionally wrote our own gender guesser. For this, we used the 
publicly available database from the Social Security Admin- 
istration, which provides lists of names from 1880-2019 with 
gender [https://www.ssa.gov/oact/babynames/background. 
html]. For the assignment of review names, we assumed 
only the lists from 1900 to 2005 since only these years are 
relevant for the reviewers. To assign a general gender to the 
respective names, we aggregated the terms and calculated 
the probability for each gender. To determine the names, the 
gender guesser can be given the probability with which the 


Table 1. Example Word Counts 


Clothing Electronics min_error 


software 6 7239 0.00083 
tv 37 5832 0.0063 
computer 281 23541 0.0118 
iPad 181 11972 0.01489 
I 251435 288756 0.46546 
excited 1635 1635 0.5 
like 72311 69457 0.48993 
cute 24077 2556 0.09597 
dress 16149 56 0.00346 
Shirt 980 3 0.00305 


name at least was assigned to boys or girls and the minimum 
number of times the name occurred. Nevertheless, there are 
names like Truth, Trust, etc., that are female as a name, for 
example, but can be used differently in a review. 


2.2 Keyword Extraction 


We define category keywords, in the context of multiple 
categories, as words that indicate that a review stems from 
a specific category. For each category and each word, we 
calculate the conditional probability for a review that if the 
word occurs at least once in the review, it originates from that 
category. The probability that a review belongs to category 
C is then given by 


P(reviewCategory = C | w) 


if the word w occurs in the review. 

Exemplary, the extraction of category keywords, or rather 
the removal of those from the reviews, will be done on the cat- 
egories Electronics (a male dominated category) and Cloth- 
ing (a female dominated category). For the category Clothing 
and the word dress, that would be 


P(dressandClothing) 


P(Clothing|dress) = P(dress) 
ress 


_ dressCountinClothingreviews 16,149 


= ——— = 0.997 
16, 149 + 56 


dressCountinreviews 

The numbers are given in Table 1. 

When predicting reviews containing the word dress, the 
error would be 0.003. Since we are interested in removing 
words that with high probability indicate the category, we 
use a minimum error (min_error) to enforce that words 
with a lower error are considered category keywords. The 
min_error allows us to test for varying degrees of category 
keyword strength, i.e. to investigate the effects that the 
choice of minimum error has on category prediction and 
gender prediction. With this approach to define category 
keywords, we cannot exclude gender specific words, that is 
words that are used differently by genders in the categories. 


Table 2. Accuracy Post-Keyword Removal 


min_error Accuracy 


Baseline 0.919562 
0.2 0.842319 
0.5 0.743375 
0.8 0.620112 
0.9 0.514000 
1.0 0.494681 


Therefore, we initially split the dataset by gender and de- 
fine the errors for the genders separately. We then define a 
category keyword so that the error for both categories lies 
below the min_error. For the rest of the paper, we also use 
a normalized min_error in the interval [0,1]. 


3 Analysis 
3.1 Category Prediction 


The initial goal is to create a machine learning model that can 
adequately detect the review category from the review text; 
we start by using a Vanilla LSTM. Surprisingly, without any 
tuning or usage of external libraries, we got a 92% baseline 
accuracy. These results give us a benchmark to compare 
to when retraining the model on review text with lesser 
vocabulary by removing category-related keywords, as we 
have explained in section 2.2. 

We define a value min_error, which indicates how ag- 
gressive the process is towards removing words that ap- 
pear relatively more in the review text of a particular cat- 
egory; this is explained in further detail in section 2.2. We 
use five values of the min_error to draw a correlation be- 
tween the aggressiveness of removal versus the drop in ac- 
curacy during the experimentation process. These values 
are min_error = {0.2, 0.5, 0.8, 0.9, 1.0}. We observed that the 
process had created a consistent decrease in inaccuracy, and 
when reaching a min_error => 0.9, the model becomes com- 
pletely random. 

The results indicate that the process we used was effec- 
tive at classifying category-related keywords that helped 
the model detect the review categories. Still, we wanted to 
get a deeper understanding of how does that compares to 
randomly removing keywords. So we modeled a different 
process in which, on one side, we use the process defined 
earlier on. On the other hand, we remove random keywords, 
but we keep the number of words in the vocabulary equiva- 
lent to the result from the original process. We then model 
the effect of accuracy on a particular gender, and we saw in- 
teresting results. The random removal of keywords in figure 
2 has been biased towards a particular category (ex. Electron- 
ics) while not affecting the accuracy of the other category 
(ex. Clothing). While the process in figure 1 we followed 


have been decreasing the accuracy of predicting both cate- 
gories more consistently and substantially more fair (i.e., It 
removes keywords that hinders the model’s ability to detect 
the category on both categories fairly) 


c Predicting Category - Little Preprocessing 


— Both categories 
Clothing 
—— Electronics 


0.9 


Accuracy 
= 
ico 


= 
sd 


0.6 





0.5 
4000 6000 8000 10000 12000 14000 16000 


Number of words in vocabulary (Removal of keywords) 


Figure 1. Effect of Keyword Removal Process on Accuracy 
of Categories Prediction 


i Predicting Category 


— Both categories 
Clothing 
— Electronics 


0.9 


= 
oo 


Accuracy 


o i Se 
a al 


a 


od 


2 
j 


0.6 





0.5 : 
4000 6000 8000 10000 12000 14000 16000 


Number of words in vocabulary (Random Removal) 


Figure 2. Effect of Random Keyword Removal Process on 
Accuracy of Categories Prediction 


3.2 Gender Prediction 


To get a better understanding of the data, we started with a 
simple logistic regression model We used the 16 categories 
with the most reviews and balanced the resulting data set by 
category and gender. The categories are Automotive, Books, 
CDs, Cell, Clothing, Electronics, Grocery, Home, Kindle, Movies, 


Office, Patio, Pet, Sports, Tools, Toys, with a total of 800 000 
reviews (50000 per category and gender). To embed the re- 
views we applied only light preprocessing and computed the 
TF-IDF scores for the 2500 most frequent words from the 
intersection of words of all considered categories and calcu- 
lated a feature matrix from this. This gave us an accuracy 
across all categories of 61.7%, see Figure 3. 


female 





True label 


male 





female male 
Predicted label 


Figure 3. Confusion matrix for Logistic Regression on all 
categories 


By ordering the coefficients of the Logistic Regression 
words, we determined the typical masculine and feminine 
tokens according to the Logistic Regression. The 15 most 
typical female tokens are husband, I, ., so, her, and, because, 
He, !, love, my, beautiful, we, for, This. Typical masculine 
tokens wife, \n, similar, while, far, likely, link, may, check, if, 
issue , version, years, why, expect. 

Although many keywords have already been removed 
from the training data for the model by forming the intersec- 
tion of words for the categories and thus were not considered 
in the training of the model, it is interesting that many of the 
very highly rated typically male and female words seem to 
have little to do with categories. This suggests that the model 
did not only predict gender depending on categories. That 
words like I and my have very high coefficients confirms 
the results from [1]. A stereotype that the model also found 
immediately is that women have and write of husbands and 
vice versa. Overall, there are about 2% reviews in the training 
dataset that contain words like husband, wife, or boyfriend 
and girlfriend, this was enough to assign husband and wife 
the highest coefficients. 

Figures 4 and 5 show the ratio of male reviews in the real 
dataset across all categories, indicated by a blue or red dot, 
depending on whether there are more female or male reviews 
in the category of the original dataset (female or male domi- 
nated, respectively). The gray dots describe the distribution 
of gender prediction from the test set. It should be noted that 
in both the test set and the training set, the reviews were 
balanced by category and gender. Ideally, the gray points lie 
on the black 0.5 line, which describes the distribution of the 


1.0 
0.8 


0.6 


. | ! 


Ratio of male Reviews 


0.2 
Automotive Books CDs Cell Clothing Electronics Grocery Home 


Figure 4. Ratio of male reviews in the original dataset (red 
and blue dots) and the prediction of the LR-model on a bal- 
anced dataset (grey dots) 


1.0 


o o o 
A (o>) foe) 


Ratio of male Reviews 


© 
N 





0.0 


Kindle Movies Office Patio Pet Sports Tools Toys 


Figure 5. Ratio of male reviews in the original dataset (red 
and blue dots) and the prediction of the LR-model on a bal- 
anced dataset (grey dots) 


reviews in the training and test set. And while there seems to 
be a tendency for reviews from male-dominated categories 
to be predicted as male rather than female by the model, 
there are differences between categories. For example, re- 
views from the Sports category are predicted to be male less 
often than Patio, even though the Sports category is more 
male-dominated. The same is true for female categories, but 
here the correlation seems to be weaker. For example, the 
Books category is dominated by females, but the reviews 
tend to be predicted by males. For the remainder of the paper, 
we decided to focus mainly on the female dominated cate- 
gory Clothing and the male dominated category Electronics, 
which have a large number of reviews and for which our 
logistic regression model seems to show a relatively large 
category bias. Again, we balanced this remaining dataset by 
category and gender. 


After successfully validating that removing category-related 


keywords is effective, we moved to the next step in the pro- 
cess. It is similar to the one we used during the Category 
Prediction, in which we train a Vanilla LSTM model on the 
corpus without preprocessing. We apply different thresh- 
olds of preprocessing from least constrained to the most 


Table 3. Gender Accuracy Post-Keyword Removal 


min_error Accuracy 


Baseline 0.687619 
0.2 0.658856 
0.5 0.632088 
0.8 0.548144 
0.9 0.559356 
1.0 0.548144 


aggressive, and we monitor the behavior of the retrained 
model. 

The model without any preprocessing reaches an accuracy 
of 69% on unseen data and similar to how we did in the 
category prediction, we apply the preprocessing of review 
text using the same min_error values {0.2, 0.5, 0.8, 0.9, 1.0}. 
We then observe some interesting behavior. 


1. As shown in table 3 there is a trend of losing accuracy, 
which could be explained by the fact that the review 
text has less information the more we remove more 
words from the vocabulary. 

2. The loss of accuracy is not as steep as it was in the 
Category Prediction; the result could be explained that 
the model for predicting the gender wasn’t as reliable 
on the removed keywords as the category prediction 
model. 

3. The more aggressive the removal of category-related 
keywords is, the more the model tends to be biased 
towards predicting males. 


087 —— correctly-female 
= correctly-male 
0.7 


percentage 
© © 
wi on 


oO 
= 


© 
ww 





error 


Figure 6. Accuracy Post-Keywords Removal per Gender 


3.3 Other Models 


Taking the previous conclusions into considerations, we 
wanted to cement our conclusion a little bit more. There- 
fore, we trained two additional models, following the same 
procedure as we did in the Vanilla LSTM model, namely 


unprocessed review text being trained. That would be con- 
sidered the benchmark, then applying the experiment on the 
five different min_error values. 


3.3.1 First model. Bidirectional LSTM, an extension of 
the Vanilla LSTM model; it helps to navigate sequences in 
both directions, leading to an increase in performance. 


3.3.2 Second model. Combination of an LSTM Layer and 
a CNN Layer, as proposed in [7]for speech recognition. We 
experiment here with a similar architecture on sequence 
classification. 

While we haven't seen a difference in the accuracy of 

the base trained model without any review preprocessing, 
we saw similar behavior in the retrained models after pre- 
processing, both in the decline of accuracy and in the bias 
toward predominantly classifying reviewers’ gender as male 
after heavy preprocessing. 
We additionally trained a simple neural net model, where we 
embedded the reviews in advance using Doc2Vec from the 
gensim library. We noticed that the training of the embed- 
ding makes up the longest part of the training. The accuracy 
is comparable to the LSTM model. 


4 Related Work 


Argamon et al. [1] looked at fictional and nonfictional texts. 
They examined over 1000 features, including function words 
and n-grams reduced the features to the most important 
ones, which is less than 50. Features indicating female texts 
included pronouns, while features indicating male texts were 
determiners, such as a, the, that, these. They found that their 
research matched with females tending to write more “in- 
volved” whereas men write more “informative” [2]. Otter- 
bacher [5] uses reviews from the Internet Movie Database 
for author gender identification. Otterbacher used style, con- 
tent, and metadata to identify gender and get a classifica- 
tion accuracy of 73.7%. For writing style, Otterbacher used 
the features: “Inner vs. outer-focused discourse” “Hedging” 
“Complexity” “Vocabulary richness” “Lexical markers”. She 
also examines the metadata that is “Movie popularity”, “Re- 
viewer rating”, “Reviewer rating deviation”, “Age”, “Length”, 
“Title length”, “Attention from the community”, “Perceived 
utility”. Otterbacher found that perceived utility correlates 
with gender. Prabhumoye et al. [6] aim to transfer gender, 
political slant, and sentiment on sentences. They intend in 
future work to remove author attributes from the text while 
preserving the meaning. For classification of writing style, 
they use CNN. As inputs, they use a method that identifies 
words that indicate a particular category. 


5 Conclusion 


In this paper, we discuss how machine learning models have 
become an integral part of our life, and therefore reflections 


of human biases in models have become a natural phenom- 
enon. In this paper, we challenge one of these biases: the 
gender bias that could be found by associating text with a 
product category. We experiment with this hypothesis on 
the data from Amazon Review Dataset Corpus. We start by 
preprocessing the data to make it ready for the different 
parts of the experiments; we validated that the process de- 
scribed for removing category-related keywords is efficient 
by training a machine learning model that detects a category 
from review text and increasingly applying more aggressive 
removal of keywords until we reached complete random- 
ness in the prediction. We then took these results and re-ran 
the experiments on different models that detect the gender 
of the reviewer; while we saw a decline in the accuracy of 
the model, it was not as steep of a decline as the one from 
the category prediction model. We also found that models 
tend to be more biased towards predicting males the more 
aggressive the removal of keywords is applied; consequently, 
models become worse at detecting female reviewers. One of 
the limitations that we faced was that the corpus used had 
low amounts in a specific cross-category (Ex. 400k Female 
Electronics Reviewers); therefore, we had to sample equally 
to this distribution in the other cross categories. Additionally, 
the experimentation required re-training different models 
with different variables, and there could be more experiments 
to validate the hypothesis. Unfortunately, this process takes 
a substantial amount of time. 

We hope that this study will take the discussion around 
Fair Machine Learning Models a step further, and by taking 
the previous conclusions, we hope that in the next phase 
of the research, more complex models can be used that has 
a better base accuracy, especially for the gender prediction 
models. Also, instead of the proposed way of identifying 
product-related keywords, frameworks like LIME can be used 
to identify the factors that contribute to a certain prediction, 
and by that, a new set of processes can be used to remove 
these keywords. We also hope to use bigger compute clusters; 
we can take more data from the review dataset, therefore 
strengthening the model’s performance by exposing it to a 
wider variety of data. 


References 


[1] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shi- 
moni. 2003. Gender, genre, and writing style in formal written texts. 
Text & Talk 23, 3 (2003), 321-346. 

[2] Douglas Biber. 1995. Dimensions of Register Variation: A Cross-Linguistic 
Comparison. Cambridge University Press. https://doi.org/10.1017/ 
CBO9780511519871 

[3] Na Cheng, Rajarathnam Chandramouli, and K. Subbalakshmi. 2011. 
Author gender identification from text. Digital Investigation 8 (07 2011), 
78-88. https://doi.org/10.1016/j.diin.2011.04.002 

[4] Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying Recom- 
mendations using Distantly-Labeled Reviews and Fine-Grained Aspects. 
188-197. https://doi.org/10.18653/v1/D19- 1018 

[5] Jahna Otterbacher. 2010. Inferring Gender of Movie Reviewers: Exploit- 
ing Writing Style, Content and Metadata. In Proceedings of the 19th ACM 


[6] 


[7] 


International Conference on Information and Knowledge Management 
(Toronto, ON, Canada) (CIKM ’10). Association for Computing Machin- 
ery, New York, NY, USA, 369-378. https://doi.org/10.1145/1871437. 
1871487 

Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and 
Alan W. Black. 2018. Style Transfer Through Back-Translation. CoRR 
abs/1804.09000 (2018). arXiv:1804.09000 http://arxiv.org/abs/1804.09000 
Tara N. Sainath, Oriol Vinyals, Andrew Senior, and Hasim Sak. 2015. 
Convolutional, Long Short-Term Memory, fully connected Deep Neural 
Networks. In 2015 IEEE International Conference on Acoustics, Speech 
and Signal Processing (ICASSP). 4580-4584. https://doi.org/10.1109/ 
ICASSP.2015.7 178838 


A Appendix 
A.1 Research Code 


We have used RWTH Aachen University’s self-hosted Gitlab. 
You can find all the code in this repository. https://git.rwth- 
aachen.de/alisoliman/gender-stereotypes 


A.2 Als Contribution 


Ali’s contributions were primarily invested in experimenting 
with different neural networks and orchestrating a pipeline 
that can iterate over models with different training variables 


and store the results to do further analysis. Furthermore, he 
has been working on capturing the differences of the models 
that were predicting gender. 


A.3 Ann’s Contribution 


Ann’s contribution was analyzing and extracting the key- 
words of the dataset and a collection of functions to prepro- 
cess reviews and embed them using different variants. This 
was used in testing and plotting the prediction of gender or 
categories, both with the removal of random words or cate- 
goric keywords, was well as different levels of preprocessing. 


